General Overview At the very onset of the research, it was necessary to download the provided dataset, codebook, and script all to a single folder, labeled “Quantitative Methods”. It was then necessary to command R to use this folder as its work directory, as well as to read the data into R, so that the subsequent commands (which dealt with statistical analysis) would yield responses based off of our specific information (data). Within the code book, I scanned through the given variables and decided to analyze ‘gays’ and ‘cfundamentals’. For each variable, I executed the commands that generated the desired summary statistics: a five-number summary, mean, and standard deviation. I also executed the command that generates a box plot, again for each variable. Performing these statistical summaries provided a clear, general understanding of the nature of the data for each given variable. The next step was to draw conclusions based on the relation between both variables. I executed the commands that made possible the viewing of data in a scatter plot. This included A) installing and B) loading the required packages (as outlined in the example script). Next, I simply commanded that the package generate a scatter plot of the data representing both the variable ‘gays’ and ‘cfundamentals’. Resulting was a scatter plot of high detail that included not just the data points of each variable, but also an image of each variable’s respective box plot along with the least squares line. This image provided a quick, visual summary of the relationship between the two variables. Also of use in determining the relationship between the two variables was calculating the correlation coefficient, or “r”, which is a value between -1 and 1 (inclusive) that represents how much one variable’s movement influences the movement of the other variable.
R Script nes<-read.csv("anes2008.csv") #read data into R
summary(nes$education) #education is a factor variable so it gives #number of respondents in each category
summary(nes$bigbiz) #bigbiz is a quantitative or numeric #variable, so it gives stats. sd(nes$bigbiz, na.rm=TRUE)
boxplot(nes$bigbiz) #boxplot of feelings toward big business hist(nes$bigbiz) #histogram of feelings toward big business plot(density(nes$bigbiz, na.rm=TRUE)) #kernel density plot of feelings toward big business
cor(nes$christians, nes$cfundamentals, use="complete.obs") #correlation between feelings #about Christians and install.packages("car") #install the "car package #for fancy scatterplots. library(car) #load it. scatterplot(nes$christians, nes$cfundamentals)#make a fancy scatterplot with #boxplots on the side and lines #of best fit!
---------------------
⌘D #Allows you to change the work directory. (In this case, it is the #Quantitative Methods folder).
nes<-read.csv("anes2008.csv") #reads data into R.
summary(nes$gays) #provides statistical summary of quantitative variable 'gays'. sd(nes$gays, na.rm=TRUE) #calculates standard deviation of variable 'gays'. boxplot(nes$gays) #generates boxplot of data associated with variable 'gays'.
summary(nes$cfundamentals) #provides statistical summary of quantitative variable 'cfundamentals'. sd(nes$cfundamentals, na.rm=TRUE) #calculates standard deviation of variable 'cfundamentals'. boxplot(nes$cfundamentals) #generates boxplot of data associated with variable 'cfundamentals'.
install.packages("car") #installs car package used to generate scatterplot. library(car) #loads the required package. scatterplot(nes$gays, nes$cfundamentals) #generates scatterplot of both variable 'gays' and #variable 'cfundamentals'.
cor(nes$gays, nes$cfundamentals, use="complete.obs") #calculates the correlation coefficient #between variable 'gays' and variable #'cfundamentals'.
Numerical Output and Analysis
Variable: ‘gays’ [five number summary] Min. Q1 Median Q3 Max. NA's
0.00 40.00 50.00 70.00 100.00 281 [mean] 49.74 [standard deviation] 27.93194
Variable: ‘cfundamentals’ [five number summary] Min. Q1 Median Q3 Max. NA's 0.00 50.00 60.00 70.00 100.00 436 [mean] 59.18 [standard deviation] 23.77798
The ‘five number summary’ provides an outline as to where some key data points are located. The minimum and maximum, for example, are the lowest and highest (respectively) points of the dataset. The median is the data point whose numerical location is in the middle of the entire dataset. Finally, the 1st quartile is the data point that lies, in terms of numerical location, between the minimum and median, while the 3rd quartile is the data point that lies between the median and maximum. This data is necessary for certain statistical analyses. For example, the range is calculated by subtracting the minimum value of a dataset from the maximum value. Another instance is the calculation of the interquartile range, which is found by subtracting the 1st quartile from the 3rd quartile. The interquartile range, or IQR, is used when determining if a dataset contains outliers. Following are the procedures of said calculations:
Variable: ‘gays’ range = maximum-minimum range = 100-0 range = 100
IQR = Q3 – Q1 IQR = 70-40 IQR = 30
Data point “y” is an outlier if it fails to satisfy the following restriction:
Q1 - IQR(1.5) < y < Q3 + IQR(1.5) 40 – 30(1.5) < y < 70 + 30(1.5) 40 – 45 < y < 70 + 45 -5 < y < 115
Variable: ‘cfundamentals’ range = maximum-minimum range = 100-0 range = 100
IQR = Q3 – Q1 IQR = 70–50 IQR = 20
Data point “y” is an outlier if it fails to satisfy the following restriction:
Q1 - IQR(1.5) < y < Q3 + IQR(1.5) 50 – 20(1.5) < y < 70 + 20(1.5) 50 – 30 < y < 70 + 30 20 < y < 100
The mean is calculated by dividing the sum of all values that constitute the dataset (associated with the given variable) by n, the total number of data points within the dataset. The mean is a value of central tendency, representing the average value of a set of points, although it may be severely affected by the presence of outliers in a dataset. For the variable ‘gays’ the mean is 49.74, and this appears normal seeing as though the data ranges from 0 to 100. For the variable ‘cfundamentals’ the mean is 59.18. This suggests that the bulk of data related to ‘cfundamentals’ is greater (in terms of quantitative value) than that of ‘gays’. The standard deviation is calculated by dividing the sum of all squared deviations of a dataset by n – 1, where n is the total number of data points. A deviation is the difference between a given data point and the mean of the dataset at large. The standard deviation represents, in terms of the actual variable being discussed, how much (on average) the data strayed from the mean. With regards to the respective mean values of each variable, both standard deviations are of a significant size, although not so much as to render the data futile.
CORRELATION COEFFICIENT between ‘gays’ and ‘cfundamentals’: r = -0.1076076
The correlation coefficient, r, represents how strong a linear relation is between two or more variables. In our case, the result of r = -0.1076076 indicates that there is a weak, negative correlation between the variables ‘gays’ and ‘cfundamentals’. The strongest correlations possible are -1 and 1. As r approaches 0, the interdependence of the variables weakens. A value of approximately -0.11 is much closer to 0 than it is to -1 or 1, and since the value is negative, we can conclude that there is a weak, negative correlation between the two variables at hand. This conclusion is relevant for the following reasons:
When r is negative, the correlation is said to be negative. This means that as one variable increases, the other decreases. In the case at hand, it can be said that there exists a negative correlation between variable ‘gays’ and variable ‘cfundamentals’, however since the given r value is so close to 0, it cannot be concluded that one variable’s movement is responsible for a significant amount of the other variable’s movement.
Box Plots (Analyses of the box plots are provided later on in the section that contains the scatter plot)
Variable: 'gays'
Variable: 'cfundamentals'
Scatter Plot, Box Plot, and Line of Best Fit for variable ‘gays’ (x-axis) and variable ‘cfundamentals’ (y-axis)
The scatter plot allows us to see where the data points lie. Observing their locations, it appears that there is no obvious data trend. This makes sense, seeing as though the correlation coefficient, r, was -0.1076076, which is very close to 0.
The box plots provide a visual representation of the data spread for each variable and indicate the following: both variables share 0 and 100 as the minimum and maximum; the values of Q1 and the median are greater in the data spread of the variable ‘cfundamentals’ than in the data spread of the variable ‘gays’; the value of Q3 appears to be the same for both variables. Lastly, the box plots reveal that the set of data for the y-axis variable has several outliers while the set of data for the x-axis has none.
The line of best fit corresponds with the correlation coefficient, in that it represents as best it can the trend of the data, or the relationship between both variables. We can see that is has a negative slope and therefore accords with the negative value of r that was previously calculated. We can also see that, for the most part, the data points in the scatter plot lie very far from the line of best fit, and this accounts for the value of r being so close to 0.
Theo, this is really thorough and I appreciate how much attention you must have paid to what you were doing. Great job, everything looks about right and shows that you are strongly on track with where we're at. I'm excited to see what you're able to do once you really have some powerful tools and techniques for making the data speak even more interestingly. Remember that you want to think creatively and analyze data creatively but then write about the data really conservatively, or only as strongly as the data really let you. But great job so far...
At the very onset of the research, it was necessary to download the provided dataset, codebook, and script all to a single folder, labeled “Quantitative Methods”. It was then necessary to command R to use this folder as its work directory, as well as to read the data into R, so that the subsequent commands (which dealt with statistical analysis) would yield responses based off of our specific information (data).
Within the code book, I scanned through the given variables and decided to analyze ‘gays’ and ‘cfundamentals’. For each variable, I executed the commands that generated the desired summary statistics: a five-number summary, mean, and standard deviation. I also executed the command that generates a box plot, again for each variable. Performing these statistical summaries provided a clear, general understanding of the nature of the data for each given variable. The next step was to draw conclusions based on the relation between both variables.
I executed the commands that made possible the viewing of data in a scatter plot. This included A) installing and B) loading the required packages (as outlined in the example script). Next, I simply commanded that the package generate a scatter plot of the data representing both the variable ‘gays’ and ‘cfundamentals’. Resulting was a scatter plot of high detail that included not just the data points of each variable, but also an image of each variable’s respective box plot along with the least squares line. This image provided a quick, visual summary of the relationship between the two variables.
Also of use in determining the relationship between the two variables was calculating the correlation coefficient, or “r”, which is a value between -1 and 1 (inclusive) that represents how much one variable’s movement influences the movement of the other variable.
R Script
nes<-read.csv("anes2008.csv") #read data into R
summary(nes$education) #education is a factor variable so it gives
#number of respondents in each category
summary(nes$bigbiz) #bigbiz is a quantitative or numeric
#variable, so it gives stats.
sd(nes$bigbiz, na.rm=TRUE)
boxplot(nes$bigbiz) #boxplot of feelings toward big business
hist(nes$bigbiz) #histogram of feelings toward big business
plot(density(nes$bigbiz, na.rm=TRUE)) #kernel density plot of feelings toward big business
install.packages("wvioplot")
library(wvioplot)
par(mfrow=c(1,2)) #graphical parameters for 1 row, 2 columns
wvioplot(nes$poorppl, col="magenta", names="Poor People") #violin plot
wvioplot(nes$welfareppl, col="cyan", names="People on Welfare") #violin plot
cor(nes$christians, nes$cfundamentals, use="complete.obs") #correlation between feelings
#about Christians and
install.packages("car") #install the "car package
#for fancy scatterplots.
library(car) #load it.
scatterplot(nes$christians, nes$cfundamentals)#make a fancy scatterplot with
#boxplots on the side and lines
#of best fit!
---------------------
⌘D #Allows you to change the work directory. (In this case, it is the #Quantitative Methods folder).
nes<-read.csv("anes2008.csv") #reads data into R.
summary(nes$gays) #provides statistical summary of quantitative variable 'gays'.
sd(nes$gays, na.rm=TRUE) #calculates standard deviation of variable 'gays'.
boxplot(nes$gays) #generates boxplot of data associated with variable 'gays'.
summary(nes$cfundamentals) #provides statistical summary of quantitative variable 'cfundamentals'.
sd(nes$cfundamentals, na.rm=TRUE) #calculates standard deviation of variable 'cfundamentals'.
boxplot(nes$cfundamentals) #generates boxplot of data associated with variable 'cfundamentals'.
install.packages("car") #installs car package used to generate scatterplot.
library(car) #loads the required package.
scatterplot(nes$gays, nes$cfundamentals) #generates scatterplot of both variable 'gays' and #variable 'cfundamentals'.
cor(nes$gays, nes$cfundamentals, use="complete.obs") #calculates the correlation coefficient #between variable 'gays' and variable #'cfundamentals'.
Numerical Output and Analysis
Variable: ‘gays’
[five number summary]
Min. Q1 Median Q3 Max. NA's
0.00 40.00 50.00 70.00 100.00 281
[mean]
49.74
[standard deviation]
27.93194
Variable: ‘cfundamentals’
[five number summary]
Min. Q1 Median Q3 Max. NA's
0.00 50.00 60.00 70.00 100.00 436
[mean]
59.18
[standard deviation]
23.77798
The ‘five number summary’ provides an outline as to where some key data points are located. The minimum and maximum, for example, are the lowest and highest (respectively) points of the dataset. The median is the data point whose numerical location is in the middle of the entire dataset. Finally, the 1st quartile is the data point that lies, in terms of numerical location, between the minimum and median, while the 3rd quartile is the data point that lies between the median and maximum. This data is necessary for certain statistical analyses. For example, the range is calculated by subtracting the minimum value of a dataset from the maximum value. Another instance is the calculation of the interquartile range, which is found by subtracting the 1st quartile from the 3rd quartile. The interquartile range, or IQR, is used when determining if a dataset contains outliers. Following are the procedures of said calculations:
Variable: ‘gays’
range = maximum-minimum
range = 100-0
range = 100
IQR = Q3 – Q1
IQR = 70-40
IQR = 30
Data point “y” is an outlier
if it fails to satisfy
the following restriction:
Q1 - IQR(1.5) < y < Q3 + IQR(1.5)
40 – 30(1.5) < y < 70 + 30(1.5)
40 – 45 < y < 70 + 45
-5 < y < 115
Variable: ‘cfundamentals’
range = maximum-minimum
range = 100-0
range = 100
IQR = Q3 – Q1
IQR = 70–50
IQR = 20
Data point “y” is an outlier
if it fails to satisfy
the following restriction:
Q1 - IQR(1.5) < y < Q3 + IQR(1.5)
50 – 20(1.5) < y < 70 + 20(1.5)
50 – 30 < y < 70 + 30
20 < y < 100
The mean is calculated by dividing the sum of all values that constitute the dataset (associated with the given variable) by n, the total number of data points within the dataset. The mean is a value of central tendency, representing the average value of a set of points, although it may be severely affected by the presence of outliers in a dataset. For the variable ‘gays’ the mean is 49.74, and this appears normal seeing as though the data ranges from 0 to 100. For the variable ‘cfundamentals’ the mean is 59.18. This suggests that the bulk of data related to ‘cfundamentals’ is greater (in terms of quantitative value) than that of ‘gays’.
The standard deviation is calculated by dividing the sum of all squared deviations of a dataset by n – 1, where n is the total number of data points. A deviation is the difference between a given data point and the mean of the dataset at large. The standard deviation represents, in terms of the actual variable being discussed, how much (on average) the data strayed from the mean. With regards to the respective mean values of each variable, both standard deviations are of a significant size, although not so much as to render the data futile.
CORRELATION COEFFICIENT between ‘gays’ and ‘cfundamentals’: r = -0.1076076
The correlation coefficient, r, represents how strong a linear relation is between two or more variables. In our case, the result of r = -0.1076076 indicates that there is a weak, negative correlation between the variables ‘gays’ and ‘cfundamentals’. The strongest correlations possible are -1 and 1. As r approaches 0, the interdependence of the variables weakens. A value of approximately -0.11 is much closer to 0 than it is to -1 or 1, and since the value is negative, we can conclude that there is a weak, negative correlation between the two variables at hand. This conclusion is relevant for the following reasons:
When r is negative, the correlation is said to be negative. This means that as one variable increases, the other decreases. In the case at hand, it can be said that there exists a negative correlation between variable ‘gays’ and variable ‘cfundamentals’, however since the given r value is so close to 0, it cannot be concluded that one variable’s movement is responsible for a significant amount of the other variable’s movement.
Box Plots
(Analyses of the box plots are provided later on in the section that contains the scatter plot)
Variable: 'gays'
Variable: 'cfundamentals'
Scatter Plot, Box Plot, and Line of Best Fit
for variable ‘gays’ (x-axis) and variable ‘cfundamentals’ (y-axis)
The scatter plot allows us to see where the data points lie. Observing their locations, it appears that there is no obvious data trend. This makes sense, seeing as though the correlation coefficient, r, was -0.1076076, which is very close to 0.
The box plots provide a visual representation of the data spread for each variable and indicate the following: both variables share 0 and 100 as the minimum and maximum; the values of Q1 and the median are greater in the data spread of the variable ‘cfundamentals’ than in the data spread of the variable ‘gays’; the value of Q3 appears to be the same for both variables. Lastly, the box plots reveal that the set of data for the y-axis variable has several outliers while the set of data for the x-axis has none.
The line of best fit corresponds with the correlation coefficient, in that it represents as best it can the trend of the data, or the relationship between both variables. We can see that is has a negative slope and therefore accords with the negative value of r that was previously calculated. We can also see that, for the most part, the data points in the scatter plot lie very far from the line of best fit, and this accounts for the value of r being so close to 0.
Theo, this is really thorough and I appreciate how much attention you must have paid to what you were doing. Great job, everything looks about right and shows that you are strongly on track with where we're at. I'm excited to see what you're able to do once you really have some powerful tools and techniques for making the data speak even more interestingly. Remember that you want to think creatively and analyze data creatively but then write about the data really conservatively, or only as strongly as the data really let you. But great job so far...